To allow comparison of gene expression with a metadata variable of interest, BITHub contains comprehensive metadata annotations of the curated datasets. The three main categories of annotation are present in BITHub:
Phenotype annotations: Relating to the phenotype and origin of sample, including sex, donorIDs, age of donor, diagnosis and ethnicity
Sequencing metrics These annotations relate to the quality of the RNA-seq samples and how they were experimentally prepared (if available).
Sample characteristics These include the charateristics of the samples.
In order to ensure the metadata information is displayed in a user-friendly manner, highly correlated metadata annotations will be removed and a subset will be used for the site.
Correlation of metadata was prepared after processing the raw metadata and expression files. If you are interested in that part of the pipeline, please refer to the README.md in the Github repo.
Prior to running cor() function, the FQCbasicStats, perSeqQual, SeqLengthDist and KmerContent columns were removed as they contained the same value, resulting in NAs.
BrainSeq metadata annotations shows duplicate information in many of its columns (e.g SampleID, SAMPLEID), which are likely a result of running the pre-processing pipeline for BITHub. Additionally, certain columns contain very similar information thus resulting in high correlation. Several QC metrics for RNA-seq QC also provide redundant information and therefore these will be removed for downstream analysis.
Correlation plot of metadata annotations from BrainSeq phase II. The metadata annotations are clustered based on correlation
The final BrainSeq metadata annotations will contain the following columns:
| OriginalMetadataColumnName | BITColumnName | Type |
|---|---|---|
| X | SampleID | Sample charactertics |
| trimmed | trimmed | Sequencing metrics |
| numReads | TotalNReads | Sequencing metrics |
| numMapped | numMapped | Sequencing metrics |
| numUnmapped | numUnmapped | Sequencing metrics |
| overallMapRate | MappingRate | Sequencing metrics |
| concordMapRate | concordMapRate | Sequencing metrics |
| totalMapped | totalMapped | Sequencing metrics |
| mitoMapped | mitoMapped | Sequencing metrics |
| mitoRate | mito_Rate | Sequencing metrics |
| totalAssignedGene | totalAssignedGene | Sequencing metrics |
| rRNA_rate | rRNA_rate | Sequencing metrics |
| RNum | SampleID | Phenotype |
| Region | StructureAcronym | Sample charactertics |
| RIN | RIN | Sequencing metrics |
| Age | AgeNumeric | Phenotype |
| Sex | Sex | Phenotype |
| Race | Ethnicity | Phenotype |
| Dx | Diagnosis | Phenotype |
| Fetal_replicating | Dev.Replicating | Sample charactertics |
| Fetal_quiescent | Dev.Quiescent | Sample charactertics |
| OPC | Adult.OPC | Sample charactertics |
| Neurons | Adult.Neurons | Sample charactertics |
| Astrocytes | Adult.Astrocytes | Sample charactertics |
| Oligodendrocytes | Adult.Oligo | Sample charactertics |
| Microglia | Adult.Microglia | Sample charactertics |
| Endothelial | Adult.Endothelial | Sample charactertics |
| NA | AgeInterval | Phenotype |
| NA | Period | Phenotype |
| NA | Regions | Sample charactertics |
BrainSpan metadata annotations contain several duplicate and redundant columns that essentially contain the same information (e.g column_num, Age.x, Braincode). BrainSpan annotations were retrieved from multiple sources, which may have led to these duplicates in annotations under different column names.
Correlation plot of metadata annotations from BrainSpan. The metadata annotations are clustered based on correlation
The following BrainSpan metadata annotations will be used for BITHub:
| OriginalMetadataColumnName | BITColumnName | Type |
|---|---|---|
| SampleID | SampleID | Sample characteristics |
| gender | Sex | Phenotype |
| structure_acronym | StructureAcronym | Sample characteristics |
| NA | Period | Phenotype |
| NA | AgeNumeric | Phenotype |
| NA | AgeInterval | Phenotype |
| NA | Diagnosis | Phenotype |
| NA | Regions | Sample characteristics |
| NA | mRIN | Sequencing metrics |
| Hemisphere | Hemisphere | Sample characteristics |
| RIN | RIN | Sequencing metrics |
| PMI | PMI | Sequencing metrics |
| pH | pH | Sequencing metrics |
| Ethnicity | Ethnicity | Phenotype |
The GTEx metadata contains comprehensive annotations of sample, sequencing and phenotype attributes. However, redundant and strongly correlated annotations, particularly for sequencing metrics, will be removed.
Correlation plot of metadata annotations of brain samples from GTEx. The correlations are plotted as numeric values to allow a btter overview of the data.
The above figure shows NA values on the correlation plot. This is primarily due to either missing values in certain columns, or no difference in values when performing correlation. Annotations which are highly correlated in the same category (e.g ReadsMapped and TotalNReads) will be removed. Additionally, information that contains too many missing annotations will also be removed, as these will impact the readability of the plots on BITHub.
The following metadata annotations will be used for GTEx:
| OriginalMetadataColumnName | BITColumnName | Type |
|---|---|---|
| SAMPID | SampleID | Sample charactertics |
| SMRIN | RIN | Sequencing metrics |
| SMTSISCH | PMI | Sequencing metrics |
| AGE | AgeInterval | Phenotype |
| SEX | Sex | Phenotype |
| SMATSSCR | AutolysisScore | Sample charactertics |
| SMNABTCH | IsolationBatchID | Sample charactertics |
| SMNABTCHT | TypeofBatch | Sample charactertics |
| SMNABTCHD | DateofBatch | Sample charactertics |
| SMGEBTCH | Genotype_or_Expression_Batch_ID | Sample charactertics |
| SMGEBTCHD | DateofGenotypeorExpressionBatch | Sample charactertics |
| SMGEBTCHT | TypeofGenotypeorExpressionBatch | Sample charactertics |
| SMCENTER | BSS_Collection_side_code | Sample charactertics |
| SMTS | Regions | Sample charactertics |
| SMTSD | StructureAcronym | Sample charactertics |
| SMTSPAX | Time_spent_in_PAXgene_fixative | Sequencing metrics |
| SME2MPRT | End_2_mapping_rate | Sequencing metrics |
| SMCHMPRS | ChimericPairs | Sequencing metrics |
| SMNTRART | IntragenicRate | Sequencing metrics |
| SMNUMGPS | No_of_Gaps | Sequencing metrics |
| SMMAPRT | MappingRate_total | Sequencing metrics |
| SMEXNCRT | ExonicRate | Sequencing metrics |
| SM550NRM | BasedNormalised | Sequencing metrics |
| SMGNSDTC | GenesDetected | Sequencing metrics |
| SMUNMPRT | Rate_of_mapped_genes_unique | Sequencing metrics |
| SM350NRM | BaseNormilization | Sequencing metrics |
| SMESTLBS | LibrarySize | Sequencing metrics |
| SMMPPD | ReadsMapped | Sequencing metrics |
| SMNTERRT | IntergenicRate | Sequencing metrics |
| SMRRNANM | rRNA | Sequencing metrics |
| SMRDTTL | TotalNReads | Sequencing metrics |
| SMMNCV | Mean_Coeff_Variation | Sequencing metrics |
| SMTRSCPT | TranscriptsDetected | Sequencing metrics |
| SMMPPDPR | MappedPairs | Sequencing metrics |
| SMUNPDRD | UnpairedReads | Sequencing metrics |
| SMNTRNRT | IntronicRate | Sequencing metrics |
| SMMPUNRT | Mapped_unique_rate_of_total | Sequencing metrics |
| SMEXPEFF | ExpressionProfilingEfficiency | Sequencing metrics |
| SMMPPDUN | MappedUnique_no_dup_flags | Sequencing metrics |
| SME2MMRT | End_2_Mismatch_Rate | Sequencing metrics |
| SME2ANTI | End_2_Antisense | Sequencing metrics |
| SME2SNSE | End_Sense_2 | Sequencing metrics |
| SME1ANTI | End_1_Antisense | Sequencing metrics |
| SME1SNSE | End_1_Sense | Sequencing metrics |
| SME1PCTS | End_1_Sense_percentage | Sequencing metrics |
| SMRRNART | rRNA_rate | Sequencing metrics |
| SME1MPRT | End_1_Mapping_rate | Sequencing metrics |
| SMNUM5CD | Num_of_Reads_Covered_5prime | Sequencing metrics |
| SMDPMPRT | DuplicationRateMapped | Sequencing metrics |
| SME2PCTS | Percentage_IntragenicEnd_2_Reads | Sequencing metrics |
| DTHHRDY | HardyScale | Phenotype |
PsychEncode metadata annotations contains limited information on the sequencing metrics. Additionally, there are some metadata annotations that show similar information and therefore are highly correlated. These include Row_IDs, Row_Versions, Contributing Studies and Notes. These columns will be removed for BITHub.
Correlation plot of metadata annotations from PsychEncode. Due to missing values in the correlation matrix, the annotations could not be clustered
The following metadata annotations will be retained for PsychEncode:
| OriginalColumnName | BITColumnName | Type |
|---|---|---|
| individualID | SampleID | Sample charactertics |
| diagnosis | Diagnosis | Phenotype |
| sex | Sex | Phenotype |
| ethnicity | Ethnicity | Phenotype |
| ageDeath | AgeNumeric | Phenotype |
| Adult.Ex1 | Adult.Ex1 | Sample charactertics |
| Adult.Ex2 | Adult.Ex2 | Sample charactertics |
| Adult.Ex3 | Adult.Ex3 | Sample charactertics |
| Adult.Ex4 | Adult.Ex4 | Sample charactertics |
| Adult.Ex5 | Adult.Ex5 | Sample charactertics |
| Adult.Ex6 | Adult.Ex6 | Sample charactertics |
| Adult.Ex7 | Adult.Ex7 | Sample charactertics |
| Adult.Ex8 | Adult.Ex8 | Sample charactertics |
| Adult.In1 | Adult.In1 | Sample charactertics |
| Adult.In2 | Adult.In2 | Sample charactertics |
| Adult.In3 | Adult.In3 | Sample charactertics |
| Adult.In4 | Adult.In4 | Sample charactertics |
| Adult.In5 | Adult.In5 | Sample charactertics |
| Adult.In6 | Adult.In6 | Sample charactertics |
| Adult.In7 | Adult.In7 | Sample charactertics |
| Adult.In8 | Adult.In8 | Sample charactertics |
| Adult.Astrocytes | Adult.Astrocytes | Sample charactertics |
| Adult.Endothelial | Adult.Endothelial | Sample charactertics |
| Dev.Quiescent | Dev.Quiescent | Sample charactertics |
| Dev.Replicating | Dev.Replicating | Sample charactertics |
| Adult.Microglia | Adult.Microglia | Sample charactertics |
| Adult.OtherNeuron | Adult.OtherNeuron | Sample charactertics |
| Adult.OPC | Adult.OPC | Sample charactertics |
| Adult.Oligo | Adult.Oligo | Sample charactertics |
| structure_acronym | StructureAcronym | Sample charactertics |
| ageOnset | ageOnset | Phenotype |
| causeDeath | causeDeath | Phenotype |
| brainWeight | brainWeight | Phenotype |
| height | height | Phenotype |
| weight | weight | Phenotype |
| ageBiopsy | ageBiopsy | Sample charactertics |
| smellTestScore | smellTestScore | Sample charactertics |
| smoker | smoker | Sample charactertics |
| Capstone_4 | Capstone_4 | Sample charactertics |
| NA | Period | Phenotype |
| NA | AgeInterval | Phenotype |
| NA | Regions | Sample charactertics |
A fundamental challenge in the analysis of complex RNA-seq datasets is determining the impact of sources of variation and their relationship with gene expression. To identify these impacts, we used variancePartition, an Bioconductor package, that uses mixed linear model to estimate the proportion of variance explained by selected covariates. Currently variancePartition has only been applied to the bulk RNA-seq datasets.
In variancePartition workflow, the contribution of highly correlated covariates is divided and therefore will result in smaller overall contributions to variation explained by these covariates. To ensure we are selecting covariates that are providing the most useful information about the data, we will also perform a canonical correlation analysis (CCA), which assesses the degree to which variables co-vary and contain the sample information. This is to ensure that variables that we selected based on the above correlations are indeed those that provide the most valueble insight into the data,
Prior to running the variancePartition pipeline, we will remove lowly expressed genes from the downstream analysis as it will skew the analysis. We will use a generous expression cut-off of 1 RPKM in at least 10% of all samples from BrainSeq. This reduces the number of genes from 58,037 to 20,452 in the expression matrix.
Density of genes in BrainSeq with respect to expression before and after filtering for lowly expressed genes
For the BrainSeq Phase II dataset, we want to ensure that metadata variables we select are the most indicative of the contribution of variation from each category. We will use AgeNumeric, RIN, mito_Rate, rRNA_rate, TotalNReads, MappingRate, StructureAcronym, Sex and Diagnosis. All these selected attributes reflect different aspects of the dataset and it will be useful for the user to which of these factors is driving the expression of their gene of interest.
Assessing correlation between covariates of interest from the BrainSeq data
The correlation plot reveals that many of the selected covariates do not correlate highly within their respective category, and therefore we will feed these into the mixed linear model.
variancePartition results for BrainSeq. Plot on right shows impact of covariates of interest on randomly selected genes whereaas left shows overall impact of covariates on all expression.
We removed lowly expressed genes proir to running variancePartition. The same cut-off was applied to BrainSpan as BrainSeq, where genes < 1 RPKM in 1 in at least 10% of all samples were removed, resulting in 19,671 genes from 52,379 genes
Density of genes in BrainSpan with respect to expression before and after filtering lowly expressed genes
The selected metadata variables for BrainSpan include AgeNumeric, RIN, mRIN, pH, PMI, StructureAcronym, Regions and Period.
Assessing correlation between covariates of interest from the BrainSpan data
variancePartition results for BrainSpan. Plot on right shows impact of covariates of interest on randomly selected genes whereaas left shows overall impact of covariates on all expression.
Lowly expressed genes were removed from the GTEx expression matrix, where a cut-ff of 1 < TPM in less than 10% of the samples was applied. This resulted in 20,849 from 56,200 genes.
Assessing correlation between covariates of interest from the GTEx data
Assessing correlation between covariates of interest from the PsychEncode data
Assessing correlation between covariates of interest from the PsychEncode data